Phillip (Armand) Bester is a medical scientist, researcher, and lecturer at the Division of Virology, University of the Free State, and National Health Laboratory Service (NHLS), Bloemfontein, South Africa
Andrie de Vries is the author of “R for Dummies” and a Solutions Engineer at RStudio
In part 2 of this series, we discussed the PhyloPi pipeline for conducting routine HIV phylogenetics in the drug resistance testing laboratory as a part of quality control. As mentioned, during HIV replication the error-prone viral reverse transcriptase (RT) converts its RNA genome into DNA before it can be integrated into the host cell genome. During this conversion, the enzyme makes random mistakes in the copying process. These mistakes or mutations can be deleterious, beneficial or may have no measurable impact on the replicative fitness of the virus. This fast rate of mutation provides enough divergence to be useful for phylogenetic analysis. As we will see in the 4th part of this series, the intrapatient divergence is smaller than the interpatient divergence. As infections spread from person to person the virus will continue to mutate and become more and more divergent. This allows us to use the genetic information we obtain while doing the drug resistance test and analyse the sequences for abnormalities.
Cuevas et al. (2015) published work on the in vivo rate of HIV evolution. Their analysis revealed the highest mutation rate of any biological entity of \(4.1 \cdot 10^{-3} (sd=1.7 \cdot 10^{-3})\). However, error-prone RT is not the only mechanism of mutation. One defence against HIV infection is an enzyme called apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like or APOBEC. As the name suggests these are enzymes which act on RNA and converts or mutates cytidine to uridine. In other words C to U which is the impact on RNA or C to T with DNA. The letter U in RNA is synonymous to T in DNA. Also, shown by Cuevas et al, these enzymes are not equally active in all people.
So how does this affect the virus in a negative way?
We first need to understand how RNA is translated into proteins. Below is a table showing the codon combinations for each of the 20 amino acids.
Amino acid encoding. Available at https://www.biologyjunction.com/protein-synthesis-worksheet/
As can be seen from the table above, some amino acids are encoded by more than one codon. For example, if we change the codon, CGU to AGA the resulting amino acid stays Arginine or R. This is referred to as a silent mutation since the resulting protein will look the same. On the other hand, if we mutate AGU to CGU the resulting mutation is from Serine to Arginine or in single letter notation, S to R. A change in the amino acid is referred to as a non-synonymous mutation. In reality, the APOBEC enzyme recognizes specific RNA sequence motifs, but just to give an idea of how this works, let’s look at an example.
Load some packages
library(ape)
library(tibble)
library(tidyr)
library(dplyr)
library(knitr)
library(plotly)
library(RColorBrewer)
library(diagram)
WT <- c("CGA", "GUU", "AUA", "GAG", "UGG", "AGU")
We have the sequence CGAGUUAUAGAGUGGAGU which we created in the cell block above as codons for clarity. We can now translate this sequence using the codon table or some function.
AA <- trans(as.DNAbin.DNAString(gsub("U", x = paste0(WT, collapse = ""), replacement = "T" )))
AA <- as.character.AAbin(AA)[[1]]
The code block above translated our RNA sequence into a protein sequence: R, V, I, E, W, S
Now let’s mutate all occurrences of C to U/T
MUT <- gsub("C", "U", WT)
The resulting mutant sequence is: UGA, GUU, AUA, GAG, UGG, AGU and if we now translate that we get
AA <- trans(as.DNAbin.DNAString(gsub("U", x = paste0(MUT, collapse = ""), replacement = "T" )))
AA <- as.character.AAbin(AA)[[1]]
the protein sequence: *, V, I, E, W, S.
The * means a stop codon was introduced. Stop codons are responsible for terminating translation from RNA to protein. If one of the viral genes has a stop codon in it the protein will truncate will be dysfunctional.
In the previous seqtion we showed the general principle of a msa. In biology sequence alingments are used to look at similarties of DNA or protein sequences. For most phylogenetic analysis a mutliple sequence alingment is a requirement and the more accurate the msa, the more accurate the phylogenetic inference. There are different ways of displaying a msa. Lets explore some of these options.
Read in the multiple sequence alignment file.
# Read in the alignment file
aln <- read.dna('example.aln', format = 'fasta')
Next we can calculate the distance matrix using the Kimura 2-parameter (K80) model. There are various models which can be applied when looking at DNA substitution models, but first lets look at Marcov chains.
tmA <- matrix(c(0.25,0.65,0.1,.25,0.25,.5,.35,.25,0.4),nrow = 3, byrow = TRUE)
stateNames <- c("No Rain","Light Rain","Heavy Rain")
row.names(tmA) <- stateNames; colnames(tmA) <- stateNames
tmA %>%
kable(
caption = "Probabilities of weather transitions"
)
| No Rain | Light Rain | Heavy Rain | |
|---|---|---|---|
| No Rain | 0.25 | 0.65 | 0.1 |
| Light Rain | 0.25 | 0.25 | 0.5 |
| Heavy Rain | 0.35 | 0.25 | 0.4 |
This example is borrowed from RPubs, thank you Janpu Hou for a very clean and clearly explained blog post on the subject.
Next, we can plot our probabilities from above.
plotmat(tmA,pos = c(1,2),
lwd = 1, box.lwd = 2,
cex.txt = 0.8,
box.size = 0.1,
box.type = "circle",
box.prop = 0.5,
box.col = "light blue",
arr.length=.1,
arr.width=.1,
self.cex = .6,
self.shifty = -.01,
self.shiftx = .14,
main = "Markov Chain")
As you can see from the table and the diagram above, we have three states (nodes) in our example and the probability of transition or staying in a stage is indicated by the edges (lines).
A big part of a scientist’s job is to observe nature and then try to apply a model to the observation. Think of Issac Newton and Albert Einstein, both of them had a very good model for gravity. Arguably, Newton’s model is easier to implement when you want to calculate a trajectory for launching a rocket and Einsteins model can explain why black holes bent light. It is time for a quote:
“All models are wrong, but some are useful” - George Box
This is very true when it comes to estimating genetic distances and phylogentic inference. Consider the image below.
transversions vs transitions. Available at https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/All_transitions_and_transversions.svg/1024px-All_transitions_and_transversions.svg.png
The figure above shows transition and transversion events. Transition between A and G (the purines) and C and T (the pyrimidines) are more likely than transversions indicated by the red arrows. The K80 model takes this into account as one of its parameters and these rates or probabilities are calculated or estimated by maximum likelihood.
Lets see what that looks like.
tmDNA <- matrix(c(0.8,0.05,0.1,0.05,
0.05,0.8,0.05,0.1,
0.1,0.05,0.8,0.05,
0.05,0.1,0.05,0.8),
nrow = 4, byrow = TRUE)
stateNames <- c("A","C","G", "T")
row.names(tmDNA) <- stateNames; colnames(tmDNA) <- stateNames
tmDNA %>%
kable(
caption = "Example K80 probabilities of transitions or transversions"
)
| A | C | G | T | |
|---|---|---|---|---|
| A | 0.80 | 0.05 | 0.10 | 0.05 |
| C | 0.05 | 0.80 | 0.05 | 0.10 |
| G | 0.10 | 0.05 | 0.80 | 0.05 |
| T | 0.05 | 0.10 | 0.05 | 0.80 |
plotmat(tmDNA,pos = c(2,2),
lwd = 1, box.lwd = 2,
cex.txt = 0.8,
box.size = 0.1,
box.type = "circle",
box.prop = 0.5,
box.col = "light blue",
arr.length=.1,
arr.width=.1,
self.cex = .6,
self.shifty = -.01,
self.shiftx = .14,
main = "Markov Chain")
The example above is a made up one but should explain the concept of a substitutions model. The viral reverse transcriptase is not a random sequence generator, but it does make mistakes. Most of the time when it is copying the RNA into DNA the base (state) stays the same. Then also the probability of a transversion vs a transition is different. If you look at the figure above where we introduced transversion and transition you will notice, that A is more similar to G and T is more similar t C in its chemical structure.
We can use the ape package and do the distance calculations using the K80 model.
# Calculate the genetic distances between sequences using the K80 model, as.mattrix makes the rest easier
alnDist <- dist.dna(aln, model = "K80", as.matrix = TRUE)
alnDist[1:5, 1:5] %>%
kable(caption = "First few rows of our distance matrix")
| 01_AE.JP.AB253686_INT | B.US.HM450245_INT | B.AU.AF407664_INT | B.CN.KJ820110_INT | B.RU.HM466986_INT | |
|---|---|---|---|---|---|
| 01_AE.JP.AB253686_INT | 0.0000000 | 0.0935626 | 0.0961965 | 0.0962887 | 0.0962887 |
| B.US.HM450245_INT | 0.0935626 | 0.0000000 | 0.0378446 | 0.0378167 | 0.0378748 |
| B.AU.AF407664_INT | 0.0961965 | 0.0378446 | 0.0000000 | 0.0454602 | 0.0494138 |
| B.CN.KJ820110_INT | 0.0962887 | 0.0378167 | 0.0454602 | 0.0000000 | 0.0479955 |
| B.RU.HM466986_INT | 0.0962887 | 0.0378748 | 0.0494138 | 0.0479955 | 0.0000000 |
The matirx has a shape of 47 by 47, we just preview the first 5 rows and columns.
The pipeline mentioned uses the Basic Local Alignment Search Tool (BLAST) to retrieve previously sampled sequences and adds these retrieved sequences to the analysis. BLAST is like a search engine you use on the web, but for protein or DNA sequences. By doing this important sequences from retrospective samples are included which enables PhyloPi to be aware of past sequences and not just batch per batch aware. Have a look at the paper for some examples.
The data we have is ready to use for heatmap plotting purposes, but since the data also contains previously sampled sequences, comparing those sequences amongst themselves would be a distraction. We are interested in those samples but only compared to the current batch of samples analysed. The figures below should explain this a bit better.
A diagram of a heatmap with lots of redundant and distracting data.
From the image above you can see that, typical of a heatmap, it is symmetrical on the diagonal. We show submitted vs retrieved samples in both the horizontal and vertical direction. Notice also, as annotated as “Distraction”, are the previous samples compared amongst themselves. We are not interested in those samples now as we would already have acted on any issues then. What we want instead, is a heatmap as depicted in the image below.
A diagram of a more focussed heatmap with the redundant and distracting data removed.
Luckily for us, we have a very powerful tool at our disposal, R, and plenty of really useful and convenient packages, like dplyr.
alnDistLong <-
alnDist %>%
as.data.frame(stringsToFactors = FALSE) %>%
rownames_to_column(var = "sample_1") %>%
gather(key = "sample_2", value = "distance", -sample_1, na.rm = TRUE) %>%
arrange(distance)
Create a new variable, combined, we will paste the names for sample_1 and sample_2 together
Final cleanup and removal of distracting data
# get the names of samples originally in the fasta file used for submission
qSample <- names(read.dna("example.fasta", format = "fasta"))
# compute new order of samples, so the new alignment is in the order of the heatmap example
sample_1 <- unique(alnDistLong$sample_1)
new_order <- c(sort(qSample), setdiff(sample_1, qSample))
Plot the heatmap using plotly for interactivity
alnDistLong %>%
filter(
sample_1 %in% qSample,
sample_1 != sample_2
) %>%
mutate(
sample_2 = factor(sample_2, levels = new_order)
) %>%
plot_ly(
x = ~sample_2,
y = ~sample_1,
z = ~distance,
type = "heatmap", colors = brewer.pal(11, "RdYlBu"),
zmin = 0.0, zmax = 0.03, xgap = 2, ygap = 1
) %>%
layout(
margin = list(l = 100, r = 10, b = 100, t = 10, pad = 4),
yaxis = list(tickfont = list(size = 10), showspikes = TRUE),
xaxis = list(tickfont = list(size = 10), showspikes = TRUE)
)
Above we used the package ape to calculate the genetic distances for the heatmap.
Another way of looking at our alignment data is to use phylogenetic inference. The PhyloPi pipeline saves each step of phylogenetic inference to allow the user to intercept at any step. We can use the newick tree file (a text file formatted as newick) and draw our own tree.
tree <- read.tree("example-tree.txt")
plot.phylo(
tree, cex = 0.8,
use.edge.length = TRUE,
tip.color = 'blue',
align.tip.label = FALSE,
show.node.label = TRUE
)
nodelabels("This one", 9, frame = "r", bg = "red", adj = c(-8.2,-46))
We have highlighted a node with a red block, with the text “This one” which we can now discuss. We have three leaves in this node, KM050043, KM050042, KM050041 and if you would look up these accession numbers at NCBI you will notice the Title of the paper it is tied to:
“HIV transmission. Selection bias at the heterosexual HIV-1 transmission bottleneck”
In this paper, the authors looked and selection bias when the infection is transmitted. They found that in a pool of viral quasispecies transmission is biased to benefit the fittest viral quasispecies. The node highlighted above shows the kind of clustering one would expect with a study like the one mentioned above. You will also notice plenty of other nodes which you can explore using the accession number and searching for it at https://www.hiv.lanl.gov/components/sequence/HIV/search/search.html